Paper Note: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Source type: paper Status: Distilled Date added: 2026-05-03
Bibliography
- Title: MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
- Authors: Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
- Year: 2026
- Venue: arXiv
- arXiv: 2512.06581v4
- URL: https://uii-america.github.io/MedGRPO/
- Local file:
../../raw/papers/MedGRPO:Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding.pdf
Why It Matters
This paper is useful for understanding how reinforcement learning for vision-language models can fail on heterogeneous medical video tasks when reward scales are not balanced. It also provides a concrete recipe for building a medical video instruction benchmark from existing expert annotations.
Reading Notes
- The paper introduces MedVidBench, a 531,850-sample medical video instruction benchmark built from 8 medical video sources and 8 task types.
- The benchmark spans laparoscopic surgery, open surgery, robotic surgery, and nursing procedures.
- Tasks are organized across three granularities:
- Video-level: video summarization, critical view of safety, next action prediction, skill assessment. - Segment-level: temporal action grounding, dense video captioning, region captioning. - Frame-level: spatiotemporal grounding.
- The authors transform existing expert annotations into instruction-following QA pairs rather than annotating from scratch.
- Their data pipeline uses source-specific prompting:
- Bounding boxes and labels are overlaid on frames for densely annotated surgical datasets. - Whisper-X transcripts and metadata are used for web-sourced medical videos. - GPT-4.1 and Gemini-2.5-Flash generate captions independently for validation.
- Naive GRPO on the heterogeneous dataset collapses because easy datasets produce consistently higher raw rewards than harder datasets.
- MedGRPO fixes this with dataset-task-specific logistic reward normalization centered on each dataset-task median.
- The normalization maps median performance for every dataset-task pair to reward 0.5, reducing bias toward easy datasets.
- For captioning tasks, semantic similarity is insufficient because clinically important differences can be hidden by high embedding similarity.
- The paper adds a medical LLM judge that evaluates caption quality on five dimensions:
- Medical terminology precision. - Instrument and anatomy identification. - Specificity versus vagueness. - Clinical procedure context. - Action and state accuracy.
- The final caption reward averages normalized semantic similarity and the medical LLM judge score.
- SFT on MedVidBench greatly improves Qwen2.5-VL-7B over off-the-shelf GPT-4.1, Gemini-2.5-Flash, and Qwen2.5-VL-7B baselines.
- MedGRPO further improves the SFT baseline on most evaluated tasks, especially grounding and captioning.
- Removing reward normalization causes catastrophic collapse in the ablation, dropping CVS from 0.894 SFT to 0.020 and STG from 0.177 to 0.010.
- Training with caption tasks also improves grounding performance, suggesting useful multi-task transfer between descriptive and localization objectives.
- The medical LLM judge is validated against board-certified clinician ratings, with reported Pearson correlation 0.977 and Cohen's Kappa 0.817.
Claims To Distill
- In heterogeneous multi-dataset RL, raw task metrics can create unfair reward scales that bias optimization toward easy datasets.
- Median-centered reward normalization can make dataset-task pairs comparable without erasing within-dataset ranking information.
- Domain-specific evaluation is necessary for medical captioning because general semantic similarity misses instrument, anatomy, action, and spatial precision.
- Multi-task medical video training benefits from connecting captioning and grounding tasks rather than optimizing them in isolation.
- Strong general VLMs still need domain adaptation for medical video understanding, especially for grounding tasks.
Methods And Evidence
- Dataset: MedVidBench, 531,850 video-instruction pairs across 626 videos, 8 medical sources, and 8 task types.
- Model/system: Qwen2.5-VL-7B SFT baseline, followed by MedGRPO reinforcement learning; also evaluated on Qwen3-VL-4B and Qwen3.5-4B variants.
- Reward design:
- Dataset-task logistic normalization using SFT baseline percentiles. - Median performance maps to normalized reward 0.5. - IQR scaling reduces outlier sensitivity. - Caption tasks combine semantic similarity and medical LLM judge score.
- Evaluation:
- Accuracy for CVS, NAP, and skill assessment. - mIoU for spatiotemporal grounding and temporal action grounding. - LLM judge scores and F1 for captioning tasks.
- Main result:
- Qwen2.5-VL-7B SFT reaches 0.894 CVS, 0.177 STG, 0.142 TAG@0.3, 3.596 VS LLM, and 2.757 RC LLM. - Qwen2.5-VL-7B MedGRPO improves to 0.896 CVS, 0.202 STG, 0.216 TAG@0.3, 4.184 VS LLM, and 3.442 RC LLM. - NAP decreases from 0.442 to 0.405 because it was not one of the optimized reward tasks.
Related Work
- GRPO and DAPO-style reinforcement learning for language or vision-language models.
- Medical video datasets such as CholecT50, CholecTrack20, Cholec80-CVS, CoPESD, AVOS, EgoSurgery, JIGSAWS, and NurViD.
- LLM-as-a-judge evaluation for domain-specific caption quality.
- Medical video-language models and instruction tuning.
Follow-ups
- Check whether MedVidBench is publicly downloadable or only described through the project website.
- Inspect the released code to see how the reward normalization statistics are computed and stored.
- Compare this median-centered normalization with z-score, percentile rank, and per-task advantage normalization.
- Evaluate whether the medical LLM judge prompt can be reused for non-video medical image captioning tasks.